Motivation

  • A report of the spatial, temporal analysis, and potential relation to other factors of the distribution of shooting incidents in NYC, also analysis the connection between shooting cases and COVID-19.
  • A website with all the work and results presented.

Background

Although open carry is not directly banned, New York City prohibits the possession of a “loaded” handgun outside of the home or place of business without a carry license. We usually think NYC are more strict on guns than some states and cities, but for the geographic location and political reasons, gun situation in NYC is much more complicated.

New York City has been the site of many Black Lives Matter protests in response to incidents of police brutality and racially motivated violence against black people, especially during George Floyd protests (May–June 2020).

The political issues have a great influence on gun violence as well, for example the change of mayor and changes of crime police.

Besides, COVID-19 literally changed lifestyle of New Yorkers, which contain the frequency and distribution of shooting cases in NYC.

Data

We have mainly used two datasets – NYPD Shooting Incident and COVID-19 Daily Counts of Cases, Hospitalizations, and Deaths, both are from the NYC open data.

NYPD Shooting Incident: Contains information about the incident number, occurrence date, time, the demographic characteristics of the perpetrator and victims geographic location, etc. of the shooting cases happen in NYC during year 2006 to 2021, and was extracted each quarter by the Office of Management Analysis and Planning. It could serve as a great source for government and police to understand some potential nature behind shooting incidents. Each row represents an unique victim, the same incident key may occur several times, meaning the incident may have multiple victims. There are many values missing for perp_age_group, perp_sex and perp_race, which is due to the fact that the information of the perpetrator was unknown or not available at the time of collection. Key variables used in our analysis are:

  • incident_key
  • occur_date, occur_time
  • incident_key
  • statistical_murder_flag
  • perp_age_group, perp_sex, perp_race
  • vic_age_group, vic_sex, vic_race
  • latitude, longitude

COVID-19 Daily Counts of Cases, Hospitalizations, and Deaths: Contains the COVID-related hospitalizations and confirmed, probable death among the whole NYC and each borough. Managed by the NYC Department of Health and Mental Hygiene Incident Command System for COVID-19 Response and is updated everyday. We clean the data by renaming and changing the types of some variables. In addition, we have added a variable borough to show the borough information more explicitly. Key variables are:

  • month, day, year
  • borough
  • borough_case_count
  • total_case_count

Initial Questions

  • What is the specific time period during the day and the year when shooting crimes are more likely to occur?
  • Are there any boroughs or precincts with a significantly higher chance of shootings?
  • What are the perpetrators’ and victims’ demographic characteristics?
  • Whether there is an association between the number of shooting cases and COVID-19?
  • Whether the occurrence of shooting incidents is related to the number of COVID-19 cases in a certain borough?

Exploratory Analysis

Data Import and Cleaning

library(tidyverse)
library(rvest)
library(knitr)
library(leaflet)
library(rgdal)
library(lubridate)
library(plotly)
library(modelr)
col1 = "#d8e1cf" 
col2 = "#438484"
theme_set(theme_minimal() + theme(legend.position = "bottom"))
options(
  ggplot2.continuous.colour = "viridis",
  ggplot2.continuous.fill = "viridis"
)
scale_colour_discrete = scale_colour_viridis_d
scale_fill_discrete = scale_fill_viridis_d

Load the historic and year-to-datedatasets of NYPD shooting incident

shooting_initial = 
  read_csv("./data/NYPD_Shooting.csv") %>% janitor::clean_names()
shooting_2021 = read_csv("./data/NYPD_shooting_New.csv") %>% janitor::clean_names()

#A variable name in shooting_new is different from the initial data, change column name in order to merge the data frames
shooting_2021 = shooting_2021 %>% 
  rename(lon_lat = new_georeferenced_column)

shooting = rbind(shooting_initial, shooting_2021)

check null

shooting %>%
  summarise_all(~ sum(is.na(.)))
## # A tibble: 1 × 19
##   incident_key occur_date occur_time  boro precinct jurisdiction_code
##          <int>      <int>      <int> <int>    <int>             <int>
## 1            0          0          0     0        0                 3
## # … with 13 more variables: location_desc <int>, statistical_murder_flag <int>,
## #   perp_age_group <int>, perp_sex <int>, perp_race <int>, vic_age_group <int>,
## #   vic_sex <int>, vic_race <int>, x_coord_cd <int>, y_coord_cd <int>,
## #   latitude <int>, longitude <int>, lon_lat <int>

For col boro

shooting = shooting %>% 
  mutate(boro = as.factor(boro)) %>%
  mutate(location_desc = replace_na(location_desc, "NONE")) %>%
  mutate(location_desc = as.factor(location_desc)) %>%
  separate(occur_date, into = c("month", "day", "year")) %>% 
  mutate(month = as.numeric(month)) %>% 
  arrange(year, month) %>% 
  mutate(year = as.character(year)) %>% 
  mutate(boro = tolower(boro)) %>% 
  mutate(boro = if_else(boro == "staten island", "staten_island", boro)) %>% 
  rename(borough = boro) %>% 
    mutate(date = str_c(month, day, year, sep = "/")) %>% 
  select(incident_key, date, everything())

Next, clean the COVID-19 case count data

Importing COVID-19 case count data

covid_counts = read.csv("./data/COVID19_data.csv", sep = ";") %>% as_tibble()

The clean dataset contains only day-by-day COVID-19 case count for each borough and the total case count in NYC of a particular day.

clean_covid = covid_counts %>% 
   janitor::clean_names() %>% 
   rename(date = date_of_interest) %>% 
   select(date, contains("case_count")) %>% 
   select(-contains(c("probable_case_count", "case_count_7day_avg", "all_case_count_7day_avg"))) %>%
   separate(date, into = c("month", "day", "year")) %>% 
   mutate_all(as.character) %>% 
   mutate_if(is.character, gsub, pattern = ",", replacement = "") %>% 
   mutate_if(is.character, as.numeric) %>%
   pivot_longer(
    cols = bx_case_count:si_case_count,
    names_to = "borough",
    values_to = "borough_case_count"
  ) %>% 
  mutate(borough = gsub("_case_count", "", borough)) %>% 
  mutate(borough = recode(borough, "bx" = "bronx","bk" = "brooklyn","mn" = "manhattan","si" = "staten_island","qn" = "queens")) %>% 
  relocate(case_count, .after = borough_case_count) %>% 
  rename(total_case_count = case_count) %>% 
  mutate(date = str_c(month, day, year, sep = "/")) %>% 
  select(date, everything())

Heatmap

shooting_heatmap = shooting_initial %>%
  mutate(occur_date = as.Date(occur_date,'%m/%d/%Y')) %>%
  mutate(occur_date = weekdays(occur_date)) %>%
  separate(occur_time, into = c("hour", "minute", "second")) %>%
  mutate(hour = as.factor(hour)) %>%
  select(incident_key, occur_date, hour) %>%
  mutate(occur_date = as.factor(occur_date),
         occur_date = fct_relevel(occur_date, "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday" , "Saturday"))

dayHour = plyr::ddply(shooting_heatmap, c( "hour", "occur_date"), summarise, N = length(incident_key))
attach(dayHour)

heatmap = ggplot(dayHour, aes(hour, occur_date)) + 
  geom_tile(aes(fill = N),colour = "white", na.rm = TRUE) +
  scale_fill_gradient(low = col1, high = col2) +  
  guides(fill = guide_legend(title = "Total Shooting Cases")) +
  theme_bw() + 
  theme_minimal() + 
  labs(title = "Time Based Heatmap",
       x = "Shooting Cases Per Hour", y = "Day of Week") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

heatmap

According to the result of this heatmap, the midnight of weekends(Sunday and Saturday) have the highest risk of shooting cases and the total shooting cases even reach 500 for an hour. Additionally, daytime between 7 in the morning and 19 in the evening seems to have lower shooting cases than the other time of the day.

Is this situation happened in every borough?

Let’s have a look!

In Brooklyn

It is very similar to the heatmap above, and the total number of shooting cases is very high as well, which means Brooklyn is one of the main shooting happened place in NYC and it kind of a representative shooting incidents distribution in NYC.

heatmap_bn = shooting_initial %>%
  filter(boro == "BROOKLYN") %>%
  mutate(occur_date = as.Date(occur_date,'%m/%d/%Y')) %>%
  mutate(occur_date = weekdays(occur_date)) %>%
  separate(occur_time, into = c("hour", "minute", "second")) %>%
  mutate(hour = as.factor(hour)) %>%
  select(incident_key, occur_date, hour) %>%
  mutate(occur_date = as.factor(occur_date),
         occur_date = fct_relevel(occur_date, "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday" , "Saturday"))

dayHour = plyr::ddply(heatmap_bn, c( "hour", "occur_date"), summarise, N = length(incident_key))
attach(dayHour)

heatmap_Bn = ggplot(dayHour, aes(hour, occur_date)) + 
  geom_tile(aes(fill = N),colour = "white", na.rm = TRUE) +
  scale_fill_gradient(low = col1, high = col2) +  
  guides(fill = guide_legend(title = "Total Shooting Cases")) +
  theme_bw() + 
  theme_minimal() + 
  labs(title = "Time Based Heatmap in Brooklyn",
       x = "Shooting Cases Per Hour", y = "Day of Week") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

heatmap_Bn

In Bronx

We can see that there is a slight difference between Brooklyn and the whole New York City, which is the total shooting cases decrease apparently. The highest case is 150 but in the total heatmap reach 500. Weekends(Sunday and Saturday) midnight still have the highest risk of shooting cases.

heatmap_bx = shooting_initial %>%
  filter(boro == "BRONX") %>%
  mutate(occur_date = as.Date(occur_date,'%m/%d/%Y')) %>%
  mutate(occur_date = weekdays(occur_date)) %>%
  separate(occur_time, into = c("hour", "minute", "second")) %>%
  mutate(hour = as.factor(hour)) %>%
  select(incident_key, occur_date, hour) %>%
  mutate(occur_date = as.factor(occur_date),
         occur_date = fct_relevel(occur_date, "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday" , "Saturday"))

dayHour = plyr::ddply(heatmap_bx, c( "hour", "occur_date"), summarise, N = length(incident_key))
attach(dayHour)

heatmap_Bx = ggplot(dayHour, aes(hour, occur_date)) + 
  geom_tile(aes(fill = N),colour = "white", na.rm = TRUE) +
  scale_fill_gradient(low = col1, high = col2) +  
  guides(fill = guide_legend(title = "Total Shooting Cases")) +
  theme_bw() + 
  theme_minimal() + 
  labs(title = "Time Based Heatmap in Bronx",
       x = "Shooting Cases Per Hour", y = "Day of Week") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

heatmap_Bx

In Queens

The distribution of shooting incidents is similiar but the number of cases is below 100 in queens. Also, you can have morning jog on Wednesday and Tuesday without any stress at 6(just kiding).

heatmap_q = shooting_initial %>%
  filter(boro == "QUEENS") %>%
  mutate(occur_date = as.Date(occur_date,'%m/%d/%Y')) %>%
  mutate(occur_date = weekdays(occur_date)) %>%
  separate(occur_time, into = c("hour", "minute", "second")) %>%
  mutate(hour = as.factor(hour)) %>%
  select(incident_key, occur_date, hour) %>%
  mutate(occur_date = as.factor(occur_date),
         occur_date = fct_relevel(occur_date, "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday" , "Saturday"))

dayHour = plyr::ddply(heatmap_q, c( "hour", "occur_date"), summarise, N = length(incident_key))
attach(dayHour)

heatmap_q = ggplot(dayHour, aes(hour, occur_date)) + 
  geom_tile(aes(fill = N),colour = "white", na.rm = TRUE) +
  scale_fill_gradient(low = col1, high = col2) +  
  guides(fill = guide_legend(title = "Total Shooting Cases")) +
  theme_bw() + 
  theme_minimal() + 
  labs(title = "Time Based Heatmap in Queens",
       x = "Shooting Cases Per Hour", y = "Day of Week") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

heatmap_q

In Manhattan

Weekends still have higher risk and the highest total shooting cases drop to 80.

heatmap_m = shooting_initial %>%
  filter(boro == "MANHATTAN") %>%
  mutate(occur_date = as.Date(occur_date,'%m/%d/%Y')) %>%
  mutate(occur_date = weekdays(occur_date)) %>%
  separate(occur_time, into = c("hour", "minute", "second")) %>%
  mutate(hour = as.factor(hour)) %>%
  select(incident_key, occur_date, hour) %>%
  mutate(occur_date = as.factor(occur_date),
         occur_date = fct_relevel(occur_date, "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday" , "Saturday"))

dayHour = plyr::ddply(heatmap_m, c( "hour", "occur_date"), summarise, N = length(incident_key))
attach(dayHour)

heatmap_m = ggplot(dayHour, aes(hour, occur_date)) + 
  geom_tile(aes(fill = N),colour = "white", na.rm = TRUE) +
  scale_fill_gradient(low = col1, high = col2) +  
  guides(fill = guide_legend(title = "Total Shooting Cases")) +
  theme_bw() + 
  theme_minimal() + 
  labs(title = "Time Based Heatmap in Manhattan",
       x = "Shooting Cases Per Hour", y = "Day of Week") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

heatmap_m

In Staten_island

Obviously, this is the safest borough considering shooting cases. The highest total shooting number is 25. The highest risk time is 3 to 4 a.m. on Saturday. Monday evening at 19 is high risk as well except weekends.

heatmap_l = shooting_initial %>%
  filter(boro == "STATEN ISLAND") %>%
  mutate(occur_date = as.Date(occur_date,'%m/%d/%Y')) %>%
  mutate(occur_date = weekdays(occur_date)) %>%
  separate(occur_time, into = c("hour", "minute", "second")) %>%
  mutate(hour = as.factor(hour)) %>%
  select(incident_key, occur_date, hour) %>%
  mutate(occur_date = as.factor(occur_date),
         occur_date = fct_relevel(occur_date, "Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday" , "Saturday"))

dayHour = plyr::ddply(heatmap_l, c( "hour", "occur_date"), summarise, N = length(incident_key))
attach(dayHour)

heatmap_l = ggplot(dayHour, aes(hour, occur_date)) + 
  geom_tile(aes(fill = N),colour = "white", na.rm = TRUE) +
  scale_fill_gradient(low = col1, high = col2) +  
  guides(fill = guide_legend(title = "Total Shooting Cases")) +
  theme_bw() + 
  theme_minimal() + 
  labs(title = "Time Based Heatmap in Staten Island",
       x = "Shooting Cases Per Hour", y = "Day of Week") +
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

heatmap_l

Shooting Incidents Across Time

For the shooting incidents across time, we use three different levels to analyze. Firstly, we compared shooting case year by year. ### Distribution of Shooting Case of Years

shooting_year = shooting %>%
  group_by(year) %>% 
  summarise(n_obs = n())
#visualization shooting incidence trend
shooting_year %>% 
  plot_ly( x = ~year, y = ~n_obs, type = "scatter", mode = "lines+markers") %>% 
  layout(title = "Shooting Incidence Trend from 2006 to 2021",
         xaxis = list(title = "Year"),
         yaxis = list(title = "Frequency"))

By observing the data set, the shooting incidence gradually decrease from 2055 cases in 2006 to 967 cases in 2019. However, due to the Covid-19 pandemic and responses to large-scale protests over the killing of George Floyd, there is a sharply surge of shooting incidents in 2020 which have 1948 cases. Since the data for 2021 is only from January to September 30th, we are not sure whether there is a decrease in the year 2021 compared to 2020.

Distribution of Shooting Case of Months

Then we take a look of average shooting cases between months from 2006 to 2021 in the New York City.

shooting_month = shooting %>%
  mutate(month = as.factor(month)) %>% 
  group_by(month) %>% 
  summarise(n_obs = n())

shooting_month %>% 
  plot_ly(x = ~month, y = ~n_obs, color = ~month, type = "bar") %>% 
  layout( title = "The Distribution of Shooting Incidence by Month",
          xaxis = list(title = "Month"),
          yaxis = list(title = "Frequency"))

The distribution of the shooting incidence by month has a bell-shape. The shooting case concentrated in summer from May to September. The reason of this may related to the time of memorial day and the labor day.

Distribution of Shooting Case of Time Range Across day

shooting_time = shooting %>% 
##format the occur_time variable to only hours.
  mutate(occur_time_hour = format(as.POSIXct(occur_time), format = "%H")) %>% 
  mutate(occur_time_hour = as.numeric(occur_time_hour)) %>% 
  group_by(occur_time_hour) %>% 
  summarise(case_numb = n())

#divide day time to 4 groups: 0-6;6-12;12-18;18-24
shooting_time = shooting_time %>% 
  mutate(occur_time_range = case_when(
    occur_time_hour >= 0 & occur_time_hour < 6 ~ "0-6",
    occur_time_hour >= 6 & occur_time_hour < 12 ~ "6-12",
    occur_time_hour >= 12 & occur_time_hour < 18 ~ "12-18",
    occur_time_hour >= 18 & occur_time_hour < 24 ~ "18-24"))
shooting_time = shooting_time %>% 
  mutate(occur_time_range = factor(occur_time_range, levels = c("0-6","6-12","12-18","18-24")))

ggplot(shooting_time, aes(x = occur_time_range, y = case_numb, fill = occur_time_range)) + geom_col(alpha = 1) + labs(x = "Occur Time Range",
       y = "Frequency",
       title = "Distribution of Shooting Case by Day")

#pie chart of ratio of shooting cases in a day
ggplot(shooting_time, aes(x = "", y = case_numb, fill = occur_time_range)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0) +
  scale_fill_brewer(palette = "Pastel1") +
  theme_void() + 
  guides(fill = guide_legend(title = "occur Time range")) +
  labs(title = "Pie Chart for Distribution of Shooting Case by Day") +
  theme(legend.position = "right")

From the bar chart, it is obvious that most of the shooting cases happens in the evening and the late night, which concentrated in 18-24 and 0-6. The pie chart clearly show the occupy of shooting cases time range take place in a day.

Distribution of shooting case across month in 2020

We would like to focus the gun violence in the year of 2020 as a critical year of surge. (Since the covid-19 starts from March 2020, to see if there any relation between shooting and Covid.)

shooting_2020 = shooting %>%
  filter(year == "2020") %>% 
  mutate(month = as.factor(month)) %>% 
  group_by(month) %>% 
  summarise(n_obs = n())

ggplot(shooting_2020, aes(x = month, y = n_obs, fill = month)) + geom_col(alpha = 1) + labs(
  x = "Month",
  y = "Frequency",
  title = "Distribution of Shooting Case across month in 2020 in NYC")

The major rise in gun violence in the city began in 2020, after a period in which violent crime dropped to its lowest levels in more than six decades. For the first half of the year, Gun violence is relatively low as the city shut down early in the pandemic. The spike of shooting cases during June to August, which is mainly because of the death of George Floyd.

Since 2020 is the critical year, we would like to analyze the average shooting case by month between year 2019 and 2020.

shooting_2019_2020 = shooting %>%
  filter(year == "2019" | year == "2020") %>% 
  mutate(month = as.factor(month)) %>% 
  group_by(year, month) %>% 
  summarise(frequency = n())

ggplot(shooting_2019_2020, aes(x = month, y = frequency, fill = year)) + geom_bar(stat = "identity", position = position_dodge(), alpha = 0.75)  + labs(
  x = "Month",
  y = "Frequency",
  title = "Shooting Case Across Month in NYC in 2019 & 2020")

According to the plot, besides there is year on year to growth between 2019 to 2020, the distribution of shooting case across month are the same.

Shooting Incidents Across Space

shooting_map = shooting %>% 
  mutate_at(c("perp_age_group", "perp_sex", "perp_race"), funs(ifelse(is.na(.), "UNKNOWN", .))) %>% 
  mutate(labels = str_c("<b>Incident Key: </b>", incident_key, 
                    "<br>", "<b>Date: </b>", date,
                    "<br>", "<b>Borough: </b>", borough,
                    "<br>", "<b>Murdered: </b>", statistical_murder_flag,
                    "<br>", "<b>Perpetrator's Race: </b>", perp_race,
                    "<br>", "<b>Victim's Race: </b>", vic_race,
                    "<br>", "<b>Perpetrator's Age: </b>", perp_age_group,
                    "<br>", "<b>Victim's Age: </b>", vic_age_group
                    ))

nyc_boro = readOGR("./data/Borough_Boundaries/geo_export_2204bc6b-9c17-46ed-8a67-7245a1e15877.shp", layer = "geo_export_2204bc6b-9c17-46ed-8a67-7245a1e15877", verbose = FALSE)
polygon_color <- colorFactor(
  palette = "viridis",
  domain = as.factor(nyc_boro@data$boro_name))

shooting_map %>% 
  leaflet() %>% 
  addTiles() %>% 
  addProviderTiles("CartoDB.Positron") %>% 
  addMarkers(lng = ~longitude, lat = ~latitude, popup = ~labels,
             clusterOptions = markerClusterOptions()) %>% 
  addPolygons(data = nyc_boro,
              weight = 0.85,
              fillColor = ~polygon_color(nyc_boro@data$boro_name),
#              fillOpacity = 0.6,
              color = "#BDBDC3",
              label = ~nyc_boro@data$boro_name)

This is an interactive map of shooting incidents from January 2006 to September 2021 in NYC. Incidents’ details will show up after clicking the icon. Please zoom in and out the map to see the total number of shooting incidents occurred in the area in the window. Borough’s name will appear when hovering over each borough.

Map of Shooting Incidents Before and After COVID-19 in NYC

shooting_map_df = shooting_map %>% 
  filter(year == c(2018,2019,2020,2021)) %>% 
  split(.,~year)


year_map <- leaflet() %>% 
  addTiles() %>% 
  addProviderTiles("CartoDB.Positron") %>% 
  addPolygons(data = nyc_boro,
              weight = 0.85,
              fillColor = ~polygon_color(nyc_boro@data$boro_name),
#              fillOpacity = 0.6,
              color = "#BDBDC3",
              label = ~nyc_boro@data$boro_name)

groupColors  <- colorFactor(
  palette = "viridis",
  domain = as.factor(names(shooting_map_df)))



names(shooting_map_df) %>%
  purrr::walk( function(df) {
    year_map <<- year_map %>%
      addCircleMarkers(data = shooting_map_df[[df]],
                          lng = ~longitude, 
                          lat = ~latitude,
                          popup = ~labels,
                          group = df,
                          radius = 0.3,
                          color =  ~groupColors(df)
#                          clusterOptions = markerClusterOptions(),
                ) 
  })

year_map %>%
  addLayersControl(
    overlayGroups = names(shooting_map_df),
    options = layersControlOptions(collapsed = FALSE)
  )

This map focuses on the shooting incidents from January 2018 to September 2021. Its purpose includes but not limited to observing the change of location and density of shooting incidents happened in the city before and after COVID-19. Year-to-year comparison can be made by selecting different years on the right upper corner.

Map of Shooting Incidents by Perpetrator’s Race/Ethicity

race_map_df = shooting_map %>% 
  filter(year == c(2019,2020,2021)) %>% 
  mutate(perp_race = tolower(perp_race)) %>% 
  split(.,~perp_race)

raceColors  <- colorFactor(
  palette = "viridis",
  domain = as.factor(names(race_map_df)))

race_map <- leaflet() %>% 
  addTiles() %>% 
  addProviderTiles("CartoDB.Positron") %>% 
  addPolygons(data = nyc_boro,
              weight = 0.85,
              fillColor = ~polygon_color(nyc_boro@data$boro_name),
#              fillOpacity = 0.6,
              color = "#BDBDC3",
              label = ~nyc_boro@data$boro_name)

names(race_map_df) %>%
  purrr::walk( function(df) {
    race_map <<- race_map %>%
      addCircleMarkers(data = race_map_df[[df]],
                          lng = ~longitude, 
                          lat = ~latitude,
                          popup = ~labels,
                          group = df,
                          radius = 0.3,
                          color =  ~raceColors(df)
#                          clusterOptions = markerClusterOptions(),
                ) 
  })

race_map %>%
  addLayersControl(
    overlayGroups = names(race_map_df),
    options = layersControlOptions(collapsed = FALSE)
  )

This map shows the location of shooting incidents from January 2019 to September 2021 categorized by the ethnic group of the perpetrator. Selection of Ethnic group for exploration is based on given information from the original data.

Demographic Characteristics

Based on the shooting dataset we obtained, we could study the demographic characteristics of the perpetrators and victims to see whether there exists some potential patterns/correlations or not.

Perpetrators’ Characteristics

First, let’s look at perpetrators’ demographic information.

According to the previously checked numbers of null values in each column, there are 8295 values missing in perp_age_group and 8261 values missing in both perp_sex and perp_race. The information missing maybe not available or unknown at the time of the report and should be considered as either “Unknown/Not Available/Not Reported.”

The same shooting incident may have several victims but corresponding to the same perpetrator. Therefore, for the perpetrators’ characteristics, we filter the data to make sure the incident key is unique to avoid counting the same perpetrators several times.

# group by perp_age_group and do a little summary
sum_perp_age = 
  shooting %>% 
  mutate(
    perp_age_group = ifelse(is.na(perp_age_group), "UNKNOWN", perp_age_group),
    perp_age_group = as.factor(perp_age_group)
  ) %>% 
  # count distinct incident case
  distinct(incident_key, .keep_all = TRUE) %>% 
  group_by(perp_age_group) %>% 
  arrange(perp_age_group) %>%
  relocate(perp_age_group) %>% 
  # filter the abnormal value
  filter(!(perp_age_group %in% c("1020", "224", "940"))) %>% 
  summarize(number = n())

# Visualize by a pie chart
sum_perp_age %>% 
  mutate(perc = scales::percent(number/sum(number))) %>% 
  ggplot(aes(x = "", y = perc, fill = perp_age_group)) +
  geom_bar(width = 1, stat = "identity") +
  geom_text(aes(label = perc),
            position = position_stack(vjust = 0.5)) +
  coord_polar("y", start = 0) +
  scale_fill_brewer(palette = "Pastel1") +
  theme_void() + 
  guides(fill = guide_legend(title = "Perpetrator's Age Group")) +
  labs(title = "Pie Chart for Different Perpetrator Age Groups") +
  theme(legend.position = "right")

Therefore, except those with unknown age group, most perpetrators are in the age group 18-24 and 24-44, which takes up 39.4% in total. We can see that most of them are quite young or in their middle age. The large percentage of unknown values may also because some of the perpetrators are still unknown/uncaught, which leaves a potential threat to the community.

# Now look at perpetrators' sex and race
perp_sex_race = 
  shooting %>% 
  distinct(incident_key, .keep_all = TRUE) %>% 
  mutate(
    perp_sex = ifelse(is.na(perp_sex), "U", perp_sex),
    perp_race = ifelse(is.na(perp_race), "UNKNOWN", perp_race)
  ) %>% 
  mutate(perp_race = fct_infreq(as.factor(perp_race))) %>%
  group_by(perp_sex, perp_race) %>% 
  relocate(perp_sex, perp_race) %>% 
  arrange(perp_sex, perp_race) %>% 
  summarise(num = n()) %>% 
  mutate(perc = round(num/sum(num), digits = 2))

# Get the position for the label of the pie charts
position1 = 
  perp_sex_race %>% 
  filter(perp_sex != "U") %>%
  mutate(csum = rev(cumsum(rev(perc*100))), 
         pos = perc*100/2 + lead(csum, 1),
         pos = if_else(is.na(pos), perc*100/2, pos))

# pie chart
perp_sex_race %>% 
  filter(perp_sex != "U") %>%
  ggplot(aes(x = "" , y = perc*100, fill = fct_inorder(perp_race))) +
  geom_col(width = 1, color = 1) +
  coord_polar(theta = "y") +
  scale_fill_brewer(palette = "Pastel1") +
  ggrepel::geom_label_repel(data = position1,
                   aes(y = pos, label = paste0(perc*100, "%")),
                   size = 4.5, nudge_x = 1, show.legend = FALSE) +
  guides(fill = guide_legend(title = "Perpetrator's Race")) +
  theme_void() +
  labs(title = "Pie Chart For Different Perpetrator Race in Different Sex") +
  facet_grid(.~perp_sex)

# Or we can try to visualize using bar chart
perp_sex_race %>% 
  ggplot(aes(x = perp_race, y = num, fill = perp_sex)) +
  geom_bar(
    stat = "identity", position = position_dodge()
  ) +
  labs(
    x = "Perpetrator's Race",
    y = "Number of Perpetrators",
    title = "Number of Perpetrators in Different Races"
  ) +
  coord_flip() +
  guides(fill = guide_legend(title = "Perpetrator's Sex")) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1), legend.position = "right")

From the perp_sex_race chart and the pie charts, most of the perpetrators are males (98.11%), and two races with highest percentage are Black (M: 74%, F:67%) and White Hispanic(M: 13%, F: 19%). Two races with lowest percentage are American Indian/Alaskan Native (nearly 0% for both male and female perpetrators) and Asian/Pacific Islanders (M: 1%, F: 1%). The race information of perpetrators does not vary a lot in different sex.

Victims’ Characteristics

Another important demographic information in shooting incidents is about the victim. What kind of characteristics do they hold? Is there some people more likely to become the victims in such tragedies?

There is only a very limited amount of missing values in the victims’ demographic characteristics: vic_age_group, vic_sex and vic_race, which makes sense cause it’s easier to gather the victims’ information compared to the criminals in shootings.

What age groups are most victims in?

vic_age_group =
  shooting %>% 
  mutate(as.factor(vic_age_group)) %>%
  group_by(vic_age_group) %>% 
  relocate(vic_age_group) %>% 
  arrange(vic_age_group) %>% 
  summarise(num = n())

vic_age_group %>% 
  mutate(perc = scales::percent(num/sum(num))) %>% 
  ggplot(aes(x = "", y = perc, fill = vic_age_group)) +
  geom_bar(width = 1, stat = "identity") +
  geom_text(aes(label = perc),
            position = position_stack(vjust = 0.5)) +
  coord_polar("y", start = 0) +
  scale_fill_brewer(palette = "Pastel1") +
  theme_void() + 
  guides(fill = guide_legend(title = "Victim's Age Group")) +
  theme(legend.position = "right") +
  labs(
    x = "Victim's Age Group",
    y = "Number of Victims",
    title = "Number of Victims in Different Age Groups"
  )

As shown above, most victims are in the age group 18-24 and 25-44.

For victims’ sex and race, similarly, first check the bar plot and pie chart.

# The sex and race characteristics of victims
vic_sex_race = 
  shooting %>% 
  mutate(vic_race = fct_infreq(as.factor(vic_race))) %>% 
  group_by(vic_sex, vic_race) %>% 
  relocate(vic_sex, vic_race) %>% 
  arrange(vic_sex, vic_race) %>% 
  summarise(num = n()) %>% 
  mutate(perc = round(num/sum(num), digits = 2))

# bar plot
vic_sex_race %>% 
  ggplot(aes(x = vic_race, y = num, fill = vic_sex)) +
  geom_bar(
    stat = "identity", position = position_dodge()
  ) +
  labs(
    x = "Victim's Race",
    y = "Number of Victims",
    title = "Number of Victims in Different Races"
  ) +
  coord_flip() +
  guides(fill = guide_legend(title = "Victim's Sex")) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1), legend.position = "right")

# To see the percentage of different race, we can draw a pie chart

# Let's get the position for the label of the pie charts first
position2 = 
  vic_sex_race %>% 
  filter(vic_sex != "U") %>%
  mutate(csum = rev(cumsum(rev(perc*100))), 
         pos = perc*100/2 + lead(csum, 1),
         pos = if_else(is.na(pos), perc*100/2, pos))

# pie chart
vic_sex_race %>% 
  filter(vic_sex != "U") %>%
  ggplot(aes(x = "" , y = perc*100, fill = fct_inorder(vic_race))) +
  geom_col(width = 1, color = 1) +
  coord_polar(theta = "y") +
  scale_fill_brewer(palette = "Pastel1") +
  ggrepel::geom_label_repel(data = position2,
                   aes(y = pos, label = paste0(perc*100, "%")),
                   size = 4.5, nudge_x = 1, show.legend = FALSE) +
  guides(fill = guide_legend(title = "Victim's Race")) +
  theme_void() +
  labs(title = "Pie Chart for Different Victim Race in Different Sex") +
  facet_grid(.~vic_sex)

Most of the victims are Black (M: 72%, F: 69%), and the number of male victims in shooting cases is much more than female victims.

COVID and Shooting Incidence

Note: The date of shooting and COVID datasets only overlap on February-December 2020, so we do the analysis base on the overlap.

shooting_mini =
  shooting %>% 
  filter(year == "2020") %>% 
  select(c("date", "incident_key", "borough"))

shooting_covid = 
  merge(x = shooting_mini, y = clean_covid, by = c("date", "borough")) %>% 
  relocate("date", "month", "day", "year", everything()) %>% 
  group_by(date) %>% 
  add_count(borough, name = "borough_n_victim") %>% # victim number equals to the count of incident_key (includind duplicate)
  distinct() %>% 
  add_count(borough, name = "borough_n_shooting") %>% # shooting number equals to the count of distinct incident_key
  select(-incident_key) %>% 
  distinct() %>% 
  add_count(date, wt = borough_n_victim, name = "total_n_victim") %>% 
  add_count(date, wt = borough_n_shooting, name = "total_n_shooting") %>% 
  mutate(
    borough = recode(borough, 
                     "bronx" = "Bronx",
                     "brooklyn" = "Brooklyn",
                     "manhattan" = "Manhattan",
                     "queens" = "Queens",
                     "staten_island" = "Staten Island")
  )

Visualize Distribution

First we want to see if there is an underlying association between total COVID cases and total shootings.

shooting_covid %>% 
  ggplot(aes(x = total_case_count, y = total_n_shooting, group = borough, color = borough)) +
  geom_point() +
  labs(
    title = "Distribution of Shooting Incident Against COVID Cases Within Each Day",
    x = "Total COVID Cases",
    y = "Total Shooting Incident"
  ) +
  guides(fill = guide_legend(title = "Borough")) +
  theme(legend.position = "right")

See separately for each borough.

shooting_covid %>% 
  ggplot(aes(x = borough_case_count, y = borough_n_shooting, group = borough, color = borough)) +
  geom_point() +
  labs(
    title = "Distribution of Shooting Incident Against COVID Cases Within Each Day",
    x = "Total COVID Cases",
    y = "Total Shooting Incident"
  ) + 
  facet_wrap(~ borough) +
  guides(fill = guide_legend(title = "Borough")) +
  theme(legend.position = "right")

We can see that for each borough, there is a general trend that higher number of shooting cases is associated with fewer COVID cases, whereas the number of shooting cases decreases when COVID case increases. This trend does not manifest in Staten Island, which may due to the limited size of data.

Then explore by month.

shooting_covid %>% 
  ggplot(aes(x = total_case_count, y = total_n_shooting, group = month, color = month)) +
  geom_point() +
  labs(
    title = "Distribution of Shooting Incident Against COVID Cases Within Each Day",
    x = "Total COVID Cases",
    y = "Total Shooting Incident"
  ) +
  facet_wrap(~ month) +
  guides(fill = guide_legend(title = "Month")) +
  theme(legend.position = "right")

We can see from the above plot that there is no general pattern of the association between total shooting and total COVID cases. But for the months in the middle of the year, especially in June, July and August, we can see that the number shooting cases is high when there are few COVID cases, which is consistent with the former analysis.

K-means Clusters

Based on the scatter plots above, we want to see whether there are specific groups within a certain range of COVID cases and shooting incidents. Therefore, we use K-means clustering to identify clusters.

kmeans_df = 
  shooting_covid %>% 
  ungroup() %>% 
  select(total_case_count, total_n_shooting)

kmeans_fit = 
  kmeans(x = kmeans_df, centers = 3)

kmeans_df %>% 
  broom::augment(kmeans_fit, .) %>%
  ggplot(aes(x = total_case_count, y = total_n_shooting, color = .cluster)) +
  geom_point() +
  labs(
    title = "K-means Clustering on Shooting Incident",
    x = "Total COVID Cases",
    y = "Total Shooting Incident"
  ) +
  guides(fill = guide_legend(title = "Cluster")) +
  theme(legend.position = "right")

We can see that the data is separated into three groups, among which the first group has a small COVID case number and a wide range of shooting incidents, the second group has medium values for both metric, and the third group has the largest COVID case number as well as relatively small shooting incidents. These groups might indicate the association between shooting and COVID varies by the seriousness of COVID status.

clusts =
  tibble(k = 2:4) %>%
  mutate(
    km_fit =    map(k, ~kmeans(kmeans_df, .x)),
    augmented = map(km_fit, ~broom::augment(.x, kmeans_df))
  )

clusts %>% 
  select(-km_fit) %>% 
  unnest(augmented) %>% 
  ggplot(aes(x = total_case_count, y = total_n_shooting, color = .cluster)) +
  geom_point(aes(color = .cluster)) +
  facet_grid(~k) +
    labs(
    title = "K-means Clustering on Shooting Incident by Cluster Number",
    x = "Total COVID Cases",
    y = "Total Shooting Incident"
  ) +
  guides(fill = guide_legend(title = "Cluster")) +
  theme(legend.position = "right")

The plots above shows k-means clusters with different group numbers, we can see that as the number of group increases, the range of both metrics shrinks by gradient.

Additional Analysis

Regression Analysis

Fit a linear regression model for total COVID case with total shooting as predictor.

total_lm = lm(total_n_shooting ~ total_case_count, data = shooting_covid) 

Model Diagnosis

broom::tidy(total_lm) %>% 
  knitr::kable()
term estimate std.error statistic p.value
(Intercept) 7.1159313 0.1860307 38.25137 0
total_case_count -0.0011568 0.0001142 -10.13164 0
set.seed(100)
par(mfrow = c(2,2))
plot(total_lm)

We can see from the Residuals vs Fitted plot that the residuals is not equally distributed around the 0 horizontal line. In fact, the residuals has a pattern of decrease, indicating that the model has high goodness of fit, but violates the assumption of the normal distribution of the residual. Besides, from the Normal Q-Q plot we can see that most of the data fit the line when the theoretical quantile is under 2, indicating the data might have a normal distribution in a certain range.

Regression Analysis

The goal of this section was to find whether there is any relationship between the number of COVID-19 case with the number of shooting case within different borough. We combined our data resources and examine the following candidate predictors for the outcome of COVID-19 case rate: the number of shooting case, the number of shooting case and Covid cases in different borough.

Model selection

Fit a linear regression model for total shooting case with total COVID case as predictor.

original_lm = lm(total_case_count ~ total_n_shooting + borough + borough_case_count + borough_n_shooting + borough_n_victim, data = shooting_covid) 

broom::tidy(original_lm) %>% 
  knitr::kable()
term estimate std.error statistic p.value
(Intercept) 338.54947 53.8712414 6.284419 0.0000000
total_n_shooting -22.84892 6.6417297 -3.440206 0.0006272
boroughBrooklyn -485.55153 49.9741152 -9.716061 0.0000000
boroughManhattan 184.94922 55.4239069 3.336994 0.0009064
boroughQueens -358.40454 54.9730146 -6.519645 0.0000000
boroughStaten Island 480.09819 85.4432598 5.618912 0.0000000
borough_case_count 3.86830 0.0659028 58.697050 0.0000000
borough_n_shooting 65.30967 29.8450194 2.188294 0.0290848
borough_n_victim -19.26937 17.0724084 -1.128685 0.2595424
set.seed(100)
par(mfrow = c(2,2))
plot(original_lm)

According to the plot, the residuals is not equally distributed around the 0 horizontal line. The residual does not follow the normal distribution. From the Q-Q plot, the upper end of the Q-Q plot to deviate from the straight line and the lower and follows a straight line then the curve has a longer till to its right and it is right-skewed. There are few influential points outside the Cook’s distance from the residual vs leverage plot.

Use Stepwise regression to find the reduced model

step_lm = step(original_lm ,direction = "backward")
FALSE Start:  AIC=6491
FALSE total_case_count ~ total_n_shooting + borough + borough_case_count + 
FALSE     borough_n_shooting + borough_n_victim
FALSE 
FALSE                      Df Sum of Sq       RSS    AIC
FALSE - borough_n_victim    1    217636  90590906 6490.3
FALSE <none>                             90373271 6491.0
FALSE - borough_n_shooting  1    818080  91191350 6493.8
FALSE - total_n_shooting    1   2021870  92395141 6500.9
FALSE - borough             4  39043008 129416279 6676.2
FALSE - borough_case_count  1 588595414 678968685 7573.9
FALSE 
FALSE Step:  AIC=6490.3
FALSE total_case_count ~ total_n_shooting + borough + borough_case_count + 
FALSE     borough_n_shooting
FALSE 
FALSE                      Df Sum of Sq       RSS    AIC
FALSE <none>                             90590906 6490.3
FALSE - borough_n_shooting  1    778718  91369625 6492.9
FALSE - total_n_shooting    1   1937594  92528501 6499.7
FALSE - borough             4  38876916 129467822 6674.4
FALSE - borough_case_count  1 589871555 680462461 7573.1
broom::tidy(step_lm) %>% 
  knitr::kable()
term estimate std.error statistic p.value
(Intercept) 339.348098 53.8805130 6.298160 0.0000000
total_n_shooting -22.309721 6.6262386 -3.366876 0.0008154
boroughBrooklyn -482.383723 49.9081326 -9.665433 0.0000000
boroughManhattan 184.832789 55.4381320 3.334037 0.0009158
boroughQueens -360.004827 54.9689274 -6.549242 0.0000000
boroughStaten Island 478.634546 85.4554940 5.600980 0.0000000
borough_case_count 3.870615 0.0658879 58.745467 0.0000000
borough_n_shooting 38.446997 18.0126077 2.134449 0.0332642

Since the p-value of borough_n_shooting is greater than 0.05, hence it is insignificant.

set.seed(100)
par(mfrow = c(2,2))
plot(step_lm)

According to the plot, there is nmo obvious difference between reduced model and original model. From the Residuals vs Leverage plot, we could observed that the spread of standardized residuals shouldn’t change as a function of leverage: here it appears to decrease, indicating heteroskedasticity. Secondly, points with high leverage may be influential, and there are some point outside the Cook’s distance dotted line, which would have high influence.

shooting_covid %>% 
  add_residuals(step_lm) %>% 
  add_predictions(step_lm) %>% 
  ggplot(aes(x = pred, y = resid)) + 
  geom_point(alpha = 0.5) +
  xlab("Fitted Values") +
  ylab("Residuals") +
  ggtitle("Residuals Against Fitted Values plot") +
  geom_abline(intercept = 0, slope = 0, color = "red")

From the plot, we observed that it have some outliers with residuals greater than 2000. It clustered around the lower single digits of the y = 0 and looks like a nonconstant variance plot.

Model Comparasion

cross validation and the plot for RMSE

cv = crossv_mc(shooting_covid, 100) %>% 
    mutate(
    train = map(train, as_tibble),
    test = map(test, as_tibble)
  ) %>% 
  mutate(
    model1 = map(train, ~lm(total_case_count ~ total_n_shooting + borough + borough_case_count + 
    borough_n_shooting + borough_n_victim, data = .x)),
    model2 = map(train, ~lm(total_case_count ~ total_n_shooting + borough + borough_case_count + 
    borough_n_shooting, data = .x))) %>% 
  mutate(
    rmse_model1 = map2_dbl(model1, test, ~rmse(model = .x, data = .y)),
    rmse_model2 = map2_dbl(model2, test, ~rmse(model = .x, data = .y))
    )

cv %>% 
  select(starts_with("rmse")) %>% 
  pivot_longer(
    everything(),
    names_to = "model",
    values_to = "rmse",
    names_prefix = "rmse_" 
  ) %>% 
  ggplot(aes(x = model, y = rmse)) +  geom_violin(fill = "orange",alpha = 0.5) +
  geom_boxplot(alpha = 0.5, color = "white") 

According to the plot, there is no obvious different between RMSE. Model 2 have slightly lower average value of RMSE, hence would be fits better than model 1. The two models are similar since we only drop one parameter for model 2.

Regression Conclusion

Model 2 seems perform a little bit better than the first model.Cross validation shows a little higher rmse for model2 than model1. Disability, smoking, unemployment rate positively associate with the firearm crude death rate. Law strength negatively associate with crude death rate.

Disscussion

Main Findings

  • Based on the time based heatmap, we can see a significant characteristics of the time the shooting cases occur: mostly occurred during the midnight till early in the morning, with an obvious increase on Saturday and Sunday.
  • The year 2006 to 2019 has witnessed a gradually decreasing of number of shooting cases with some drastic decline on certain year. However, in year 2020, shooting cases increased dramatically to over 1800 cases. This may be related to COVID-19.
  • The month June, July and August have more shooting incidences occurred compared to the rest of the year, which has contradicted our original thought of massive amounts of shootings may happen at the end of the year.
  • Overall, Bronx and Brooklyn have more shooting cases.
  • Most of both the perpetrators and victims are in the age range 18-24 and 25-44, and is Black or White Hispanic, male.
  • For each borough except Staten Island, there is a general trend that higher number of shooting cases is associated with fewer COVID cases.